Using Compression For Source Based Classification Of Text

نویسندگان

  • Nitin Thaper
  • Arthur C. Smith
چکیده

This thesis addresses the problem of source based text classification. In a nutshell, this problem involves classifying documents according to “where they came from” instead of the usual “what they contain”. Viewed from a machine learning perspective, this can be looked upon as a learning problem and can be classified into two categories: supervised and unsupervised learning. In the former case, the classifier is presented with known examples of documents and their sources during the training phase. In the testing phase, the classifier is given a document whose source is unknown, and the goal of the classifier is to find the most likely one from the category of known sources. In the latter case, the classifier is just presented with samples of text, and its goal is to detect regularities in the data set. One such goal could be a clustering of the documents based on common authorship. In order to perform these classification tasks, we intend to use compression as the underlying technique. Compression can be viewed as a predict-encode process where the prediction of upcoming tokens is done by adaptively building a model from the text seen so far. This source modelling feature of compression algorithms allows for classification by purely statistical means. Thesis Supervisor: Shafi Goldwasser Title: RSA Professor of Computer Science and Engineering

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Compression For Source Based Classification Of

This thesis addresses the problem of source based text classification. In a nutshell, this problem involves classifying documents according to "where they came from" instead of the usual "what they contain". Viewed from a machine learning perspective, this can be looked upon as a learning problem and can be classified into two categories: supervised and unsupervised learning. In the former case...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001